Introduction

Apache Spark is an open-source engine for large-scale parallel data processing known for its speed, ease of use, and cutting-edge analytics.

  • It provides high-level APIs in general-purpose programming languages such as Scala, Python, and R, as well as an optimization engine that supports standard data analysis methods.
  • It also supports many built-in libraries that fulfill users’ needs, including Spark SQL for DataFrames, Spark pandas API for pandas workloads, MLlib for machine learning, Graphx for graph visualizations, and Structured Streaming for workflow processing.

Azure Databricks is an analytics platform based on Microsoft Azure cloud services, providing the latest versions of Apache Spark and allowing consistent integration with open source libraries. Built with Spark capabilities, Azure Databricks provides a cloud platform with interactive workspaces and fully managed Spark clusters.

  • Such capabilities allow users to create and configure clusters in seconds, gain instant access to Apache Spark capabilities, and build quickly with Azure’s global scale and availability.
  • The analytics platform includes an easy setup, streamlined workflow, and an integrated Azure Workspace, allowing users to work in a single, easy-to-use environment and quickly schedule Spark code developed on Apache Spark.

 

Data Description

Twitter (Elon Musk 2015-2022): Dataset of Elon Musk’s most recent Tweets during 2015-2022, stored in JSON format, where each tweet is in its own separate JSON object. All Tweets are collected, parsed, and plotted using rtweet in R. In total, there are more than thousands of tweets in this dataset, including retweets and replies. All objects are to go into a single database.

Data Set Variables
1 status_id 14 hashtags 27 quoted_followers_count 40 retweet_location
2 created_at 15 symbols 28 quoted_location 41 retweet_description
3 user_id 16 media_expanded_url 29 quoted_description 42 retweet_verified
4 screen_name 17 media_type 30 quoted_verified 43 name
5 text 18 mentions_screen_name 31 retweet_status_id 44 location
6 source 19 quoted_status_id 32 retweet_text 45 description
7 reply_to_screen_name 20 quoted_text 33 retweet_created_at 46 followers_count
8 is_quote 21 quoted_created_at 34 retweet_source 47 friends_count
9 is_retweet 22 quoted_source 35 retweet_favorite_count 48 account_created_at
10 favorite_count 23 quoted_favorite_count 36 retweet_retweet_count 49 verified
11 retweet_count 24 quoted_retweet_count 37 retweet_user_id
12 quote_count 25 quoted_user_id 38 retweet_screen_name
13 reply_count 26 quoted_screen_name 39 retweet_followers_count

Twitter API

Once you have your twitter app setup and loaded the rtweet package in R, you can run the following command to generate a token for the Twitter data. The first thing that you need to setup in your code is your authentication and then create a token that authenticates access to tweets.

library(rtweet)

twitter_token <- create_token(
  app = "AppName",
  consumer_key = < api_key > ,
  consumer_secret = < api_key_secret > ,
  access_token = < access_token > ,
  access_secret = < access_token_secret >
)

use_oauth_token(twitter_token)

Now we search Twitter’s full archive API. We run the search_fullarchive command in order to be able to access historical tweets from a user. The example below captures all of Elon Musk’s tweets from January 01, 2010 to May 28, 2022.

users <- "elonmusk"
musktweets <- search_fullarchive(
  paste0("from:", users),
  n = < number of tweets >, 
  env_name = < enviroment-name >,
  fromDate = < YYMMDDHHMM >,
  toDate = < YYMMDDHHMM >
)

Data Preview:

created_at screen_name text source favorite_count retweet_count quote_count reply_count hashtags media_type is_quote is_retweet
2022-05-27 22:52:53 elonmusk Starship 24 rolls out to the pad at Starbase https://t.co/PGh6FY6x8w Twitter for iPhone 0 0 0 0 NA photo FALSE TRUE
2022-05-27 22:16:31 elonmusk (westcoastbill?) It was 96% in favor of me, until I commented. Ironically, there might be more left-wing bots following me than right-wing haha. Twitter for iPhone 23603 1177 83 1466 NA NA FALSE FALSE
2022-05-27 22:03:30 elonmusk (IheartTesla?) (Tesla?) V true Twitter for iPhone 1869 92 7 152 NA NA FALSE FALSE
2022-05-27 20:59:20 elonmusk (EvaFoxU?) Yes Twitter for iPhone 3014 142 9 236 NA NA FALSE FALSE
2022-05-27 20:56:05 elonmusk Great potential to lift people out of poverty. Providing Internet is teaching people to fish. Twitter for iPhone 140136 11811 912 6232 NA NA FALSE FALSE
2022-05-27 20:54:04 elonmusk One Starlink can provide Internet for an entire school of hundreds of students Twitter for iPhone 202116 16217 1316 21341 NA NA FALSE FALSE

Questions of Interest

  1. (path finding) Display the thread (replies) of tweets (the tweet, time, id, in reply to id, user name with their screen name) posted by Elon Musk with screen_name in the order in which they were posted.
  2. (location) From which location have the tweets been most actively posted (most number of tweets)?
  3. (hashtags) Which hashtags does Musk use the most, and how many tweets are associated with these hashtags?
  4. (topics) What word does Musk mention the most in his tweets? What company products does Musk mention the most in his tweets? Products include Falcon 9, Starlink Satellites, Model 3 cars, etc.
  5. (trending) Are there any trends of what Musk tweets about the company?
  6. (nature of engagement) What is the percentage of different types of tweets (simple tweet, reply, retweet, quoted tweet) to their overall number of tweets?

Storing Data in Databricks

Building a Databricks workspace using an Apache Spark cluster.

First, we create a Databricks workspace from the Azure portal and then launch the workspace, which redirects us to the interactive Databricks portal. We create a Spark cluster from the Databricks interactive workspace and configure a notebook on the cluster. In the notebook, we use PySpark to read data from a dataset into a Spark DataFrame. Using the Spark DataFrame, we can run a Spark SQL job to query the data.

The Databricks File System (DBFS) is a distributed system that scales to an Azure Databricks workspace and builds on a Databricks cluster. FileStore is a specific folder in DBFS where users can import and store data files from their local computer into a DBFS web browser. The Spark API can then read the imported data into an Apache Spark DataFrame. For example, we can import a CSV file from our local machine into a table using the Create Table UI in the Azure Databricks workspace portal.

twitter_df = spark.read.csv(path = 'dbfs:/FileStore/dfclean.csv', 
                            header = "true", multiLine = "true")

# Register table so it is accessible via SQL Context
twitter_df.createOrReplaceTempView('twitter_df')

The Azure Databricks system stores this data file in FileStore, located in the /FileStore/tables folder. We can then use the following PySpark command to read this data.


Querying Twitter Data

Users can perform relational procedures on DataFrames using a domain-specific language (DSL) similar to R dataframes and Python Pandas. DataFrames support standard relational operators, including projection (select), filter (where), join, and aggregations (group by).

user_id screen_name name description followers_count friends_count favourites_count account_created_at verified
x44196397 elonmusk Elon Musk NA 95589999 114 13386 2009-06-02 20:12:29 TRUE

Query 1. (mentions)

Display the thread (mentions) of tweets (the tweet, time, id, in mention to id, user name with their screen name) posted by Elon Musk with screen_name in the order in which they were posted.

from pyspark.sql.functions import col

mentionDF = spark.sql(
  "SELECT screen_name, \
          created_at, \
          text, \
          mentions_user_id, \
          mentions_screen_name \
  FROM twitter_df"
)

mentionDF.filter(col("mentions_user_id") != "NA") \
    .write \
    .format('csv') \
    .mode('overwrite') \
    .option("overwriteSchema", "true") \
    .saveAsTable("mentionDF")

mentionDF.createOrReplaceTempView('mentionDF')
display(spark.read.table("mentionDF"))
screen_name text created_at mentions_user_id mentions_screen_name
elonmusk Yeah, this really is me, as my Mom (mayemusk?) will attest. Not sure I can handle just doing 140 char missives. Will put longer thoughts on G+ 2011-12-21 11:13:59 x14171401 mayemusk
elonmusk Vending Machine Attendant Admits B3 Selection Has Changed A Lot Over The Years http://t.co/nccSGzCQ #OnionInnovation 2011-12-27 19:54:15 x14075928 TheOnion
elonmusk Any Idiot Could Have Come Up With The Car http://t.co/e9cLgfEg #OnionInnovation 2011-12-28 22:29:48 x14075928 TheOnion
elonmusk (TheOnion?) So true :) 2011-12-28 22:27:08 x14075928 TheOnion

SQL:

SELECT mentions_screen_name,
  COUNT(*) AS n
FROM mentionDF
WHERE mentions_screen_name != 'NA'
GROUP BY mentions_screen_name
SORT BY n DESC;

Graphic Report 1.


Query 2. (types)

What is the percentage of different types of tweets (simple tweet, reply, retweet, quoted tweet) to their overall number of tweets?

Create Managed Tables

When you create a managed table, Spark will manage both the table data and the metadata. In particular data is written in the default Hive warehouse. We want to create a table for each tweet type. Below we visualize the different tweet types and frequency from Elon Musk.

CREATE TABLE tweet_type (
    status_id INT auto_increment PRIMARY KEY,
    tweet_variant VARCHAR(50),
    user_name VARCHAR(50),
    created_at DATE,
    tweet_text TEXT
);

So, the different tweet types are retweet, quoted, reply_to, and normal. Now, let’s create a table based on Elon Musk’s retweets.

CREATE TABLE data_retweets (
  retweet_status_id CHAR, 
  retweet_text TEXT, 
  retweet_created_at DATE,
  retweet_source TEXT,
  retweet_favorite_count INT,
  retweet_retweet_count INT,
  retweet_user_id CHAR,
  retweet_screen_name TEXT,
  retweet_name TEXT,
  retweet_followers_count INT,
  retweet_friends_count INT,
  retweet_statuses_count INT,
  retweet_location TEXT,
  retweet_description TEXT,
  retweet_verified BOOLEAN
);
retweet_status_id retweet_text retweet_created_at retweet_source retweet_favorite_count retweet_retweet_count retweet_user_id retweet_screen_name retweet_name retweet_followers_count retweet_friends_count retweet_statuses_count retweet_location retweet_description retweet_verified
x151742906267467777 Vending Machine Attendant Admits B3 Selection Has Changed A Lot Over The Years http://t.co/nccSGzCQ #OnionInnovation 2011-12-27 19:15:02 Hootsuite 22 120 x14075928 TheOnion The Onion 11836167 7 89149 NA America’s Finest News Source. TRUE
x152151847614943233 Any Idiot Could Have Come Up With The Car http://t.co/e9cLgfEg #OnionInnovation 2011-12-28 22:20:01 Hootsuite 32 51 x14075928 TheOnion The Onion 11836167 7 89149 NA America’s Finest News Source. TRUE
SELECT retweet_screen_name, 
       COUNT(status_id) AS cnt
FROM data_retweets
GROUP BY retweet_screen_name
SORT BY cnt DESC;


Query 4. (hashtags)

Which hashtags does Musk use the most, and how many tweets are associated with these hashtags?

Here we want to extract hashtags from the content of the Tweets data. The following command unpacks the hashtags column into an array of strings, followed by counting how many unique hashtags used by Elon Musk.

hashtagsDF = spark.sql("SELECT \
  hashtags, \
  COUNT(*) AS n \
  FROM twitter_df \
  GROUP BY hashtags")

hashtagsDF.filter(col("hashtags") != "NA") \
     .write \
     .format('csv') \
     .mode('overwrite') \
     .option("overwriteSchema", "true") \
     .saveAsTable("hashtagsDF")
 
display(spark.read.table("hashtagsDF"))
hashtagsDF.createOrReplaceTempView('hashtagsDF')

Graphic Report 4.


Query 5. (topics)

What word does Musk mention the most in his tweets? What company products does Musk mention the most in his tweets?

import pyspark.sql.functions as f
data_df = twitter_df.select('text', 'status_id', 'created_at')

# Count and group word frequencies on text, when split by space comma

data_df.withColumn('word', f.explode(f.split(f.col('text'), ' '))) \
  .groupBy('word') \
  .count() \
  .sort('count', ascending=False) \
  .show()

Graphic Report 5.


Sentiment Analysis


References

Brown, Leif, Jason Howell, and Mary McCready. 2022. “Quickstart - Run a Spark Job on Azure Databricks Workspace Using Azure Portal.” Microsoft Technical Documentation. https://docs.microsoft.com/en-us/azure/databricks/scenarios/quickstart-create-databricks-workspace-portal.
Kearney, Michael W. 2019. “Rtweet-Workshop.” Data Science and Analytics Presentation. University of Missouri School of Journalism: Informatics Institute. https://rtweet-workshop.mikewk.com/.
Kearney, Michael W., Francois Briatte, Andrew Heiss. 2019. “Rtweet: Collecting and Analyzing Twitter Data.” Journal of Open Source Software 4 (42): 1829. https://doi.org/10.21105/joss.01829.